The CAZyme prediction tools or classifiers dbCAN, CUPP and eCAMI were independently evaluated against a high quality benchmark test set. The performances were evaluated upon the CAZyme/non-CAZyme differentiation, multilabel CAZy class classification, and multilabel classification of CAZy family annotations.

Results summary:
- dbCAN and DIAMOND showed the strongest performances in CAZyme/non-CAZyme differentiation
- dbCAN was the strongest performing tool across all categories, Hotpep (a tool invoked by dbCAN) was the weakest
- The performances between CUPP and eCAMI were similar, although CUPP showed a marginally better performance when comparing the multilabel classification of CAZy family annotations
- The performance of dbCAN may be optimised by substituting Hotpep with CUPP and/or eCAMI

1 Introduction

The CAZyme prediction tools or classifiers classifiers dbCAN (Zhange et al. 2018), CUPP (Barrett and Lange, 2019) and eCAMI (Xu et al. 2019) use different methods to predict if a protein is a CAZyme or non-CAZyme, and predict the CAZy family annotations for predicted CAZymes. These classifiers have not been independently, comprehensively or reproducibly evaluated against a high quality benchmark test set.

The Python package pyrewton was used to create the test sets for the evaluation, invoke the CAZyme classifiers, and perform the statistical evaluations of the performances (using the sklearn library).

This notebook layouts out the independent, reproducible and comprehensive evaluation of dbCAN, CUPP and eCAMI against a high quality benchmark test set. The tools were evaluated at three levels of CAZyme classification: CAZyme/non-CAZyme, CAZy class and CAZy family classification. Specifically, this evaluation evaluates the performance of:
- Binary CAZyme / non-CAzyme classifications - Multilabel classification of CAZy class classifications - Binary CAZy class classification of each CAZy class indpendent of all other CAZy classes - Multilabel classification of CAZy family classifications - Binary CAZy family classification of each CAZy class indpendent of all other CAZy families

dbCAN incorporates the three protein function classifiers HMMER (Potter et al. 2018), Hotpep (Busk et al. 2017), and DIAMOND (Buchfink et al. 2015). In order to comprehensively evaluate the preformance of dbCAN, the predictions from HMMER, Hotpep and DIAMOND were evaluated independently of each other, and the consensus prediction (a prediction which at least two of the tools agree upon) was defined as the dbCAN result.

2 Test sets

70 test sets containing 100 CAZymes and 100 non-CAZymes each, were used in the evaluation. The CAZyme classifiers parsed the same 70 test sets.

Each test set was created from a unique genomic assembly. From each genomic assmembly, 100 CAZymes were selected at random, and the 100 non-CAZymes with the highest sequence similarity to 100 selected CAZymes were included in the test set.Choosing the 100 non-CAZymes with the highest sequence similarity was increased probability of causing confusion. Therefore, the performance of the CAZyme classifiers were evaluated test sets designed to cause the classifiers the greatest confusion, and this produce a baseline of each classifiers performance and avoid providing an overoptimisic evaluation of their perforamnce. An equal number of CAZymes to non-CAZymes was selected to prevent over representation of one population over the other.

For inclusion of a genomic assembly for the creation of a test set, the assembly had to meet of all the following criteria:

  • Contains at least 100 CAZymes
  • Contains at least 100 non-CAZymes
  • Has an ‘Assembly level’ of ‘Complete Genome’ in the NCBI Assembly database
  • Protein records are still present in NCBI
  • Not listed as an ‘Anomalous assembly’ in the NCBI Assembly database

The genomic assemblies were also chosen from a range of taxonomies to provide as comprehensive understanding of the performance of the classifiers over a range of datasets.

Table 2.1 contains the genomic assemblies used to create the test sets for the evaluation. In total 70 assemblies were chosen:

  • 1 Oomycete
  • 19 Fungi
  • 9 Yeast
  • 2 Eukaryotes
  • 20 Gram Positive Bacteria
  • 19 Gram Negative Bacteria
Table 2.1: Genomic assembiles selected for creation of high quality benchmarking test sets for the evaluation of the CAZyme classifiers dbCAN, CUPP and eCAMI
Phylogeny Strain NCBI Taxonomy ID GenBank Assembly Accession Number of CAZymes in CAZy
Oomycetes Dictyoglomus turgidum DSM 6724 515635 GCA_000021645.1 101
Fungi Ascomycetes Aspergillus flavus NRRL 3357 227321 GCA_009017415.1 441
Aspergillus chevalieri M1 182096 GCA_016861735.1 521
Metarhizium brunneum 4556 500148 GCA_013426205.1 394
Peltaster fructicola (ascomycetes) LNHT1506 403677 GCA_001592805.2 267
Penicillium digitatum (ascomycetes) PdW03 36651 GCA_016767815.1 318
Micromonas commoda (green algae) RCC299 296587 GCA_000090985.2 148
Yarrowia lipolytica DSM 3286 4652 GCA_014490615.1 133
Botrytis cinerea B05.10 332648 GCA_000143535.4 341
Eremothecium gossypii ATCC 10895 284811 GCA_000091025.4 108
Kluyveromyces lactis CBS 2105 28985 GCA_007993695.1 242
Kluyveromyces lactis NRRL Y-1140 284590 GCA_000002515.1 118
Kluyveromyces marxianus CBS4857 4911 GCA_001854445.2 123
Pyricularia oryzae 318829 GCA_004346965.1 550
Sugiyamaella lignohabitans CBS 10342 796027 GCA_001640025.2 150
Fusarium culmorum Class2-1B 5516 GCA_016952355.1 486
Fusarium oxysporum Fo47 660027 GCA_013085055.1 719
Fusarium pseudograminearum Class2-1C 101028 GCA_016952305.1 485
Cordyceps militaris ATCC 34164 73501 GCA_008080495.1 319
Clavispora lusitaniae P5 36911 GCA_009498115.1 135
Yeast Brettanomyces bruxellenesis UCD 2041 5007 GCA_011074885.2 140
Pichia kudriavzevii CBS573 4909 GCA_003054445.1 137
Brettanomyces nanus CBS 1945 13502 GCA_011074865.2 140
Metschnikowia aff. pulcherrima (budding yeasts) APC 1.2 2163413 GCA_004217705.1 136
Zygosaccharomyces parabailii ATCC 60483 1365886 GCA_001984395.2 220
[Candida] glabrata BG2 5478 GCA_014217725.1 146
[Candida] auris B11220 498019 GCA_003013715.2 131
[Candida] auris B11245 498019 GCA_008275145.1 131
Candida dubliniensis CD36 573826 GCA_000026945.1 140
Eukrayote Ostreococcus lucimarinus CCE9901 436017 GCA_000092065.1 115
Chloropicon primus CCMP1205 1764295 GCA_007859695.1 221
Gram Positive Bacteria Hungateiclostridium thermocellum ATCC 27405 203119 GCA_000015865.1 144
Hungateiclostridium clariflavum DSM 19732 720554 GCA_000237085.1 147
Alicyclobacillus sp. SO9 2665646 GCA_016406125.1 113
Bacillus altitudinis 11-1-1 293387 GCA_013283915.1 100
Bacillus amyloliquefaciens MOH1-5b 1039 GCA_014792065.1 102
Bacillus amyloliquefaciens KHG19 1292358 GCA_000835145.1 101
Dickeya chrysanthemi Ech1591 561229 GCA_000023565.1 108
Dickeya dianthicola ME23 1940567 GCA_003403135.1 116
Enterococcus faecium isolate 2014-VREF-268 1352 GCA_002025045.1 104
Enterococcus casseliflavus EC291 37734 GCA_009707345.1 145
Clostridium saccharoperbutylacetonicum N1-504 36745 GCA_002003305.1 214
Clostridium beijerinckii NCIMB 14988 1520 GCA_000833105.2 193
Ruminiclostridium cellulolyticum H10 394503 GCA_000022065.1 144
Streptomyces bingchenggensis BCW-1 749414 GCA_000092385.1 387
Streptomyces sporoclivatus NBRC 100767 284038 GCA_009936315.1 361
Schleiferilactobacillus harbinensis NSMJ42 304207 GCA_008694105.1 153
Streptacidiphilus sp. P02-A3a 2704468 GCA_014084105.1 288
Streptosporangium roseum DSM 43021 479432 GCA_000024865.1 254
Nocardia arthritidis AUSMDU00012717 228602 GCA_011801145.1 189
Mycobacterium sp. JS623 212767 GCA_000328565.1 136
Gram Negative Bacteria Actinobacillus equuli NCTC9435 718 GCA_900638075.1 117
Azospirillum brasilense Sp 7 192 GCA_001315015.1 161
Caulobacter segnis ATCC 21756 509190 GCA_000092285.1 116
Cellvibrio japonicus Ueda107 498211 GCA_000019225.1 222
Enterobacter asburiae CAV1043 61645 GCA_003940765.1 205
Escherichia coli 142 562 GCA_005221905.1 100
Escherichia coli 144 562 GCA_005221585.1 282
Klebsiella aerogenes 035 548 GCA_011604725.1 110
Klebsiella michiganensis BD177 1134687 GCA_010093005.1 157
Klebsiella oxytoca KONIH4 571 GCA_002906395.1 162
Pseudobacter ginsenosidimutans Gsoil 221 661488 GCA_007970185.1 233
Pseudomonas cerasi 1583341 GCA_900074915.1 101
Salmonella enterica subsp. arizonae NCTC10047 59203 GCA_900635675.1 146
Serratia marcescens 11/2010 615 GCA_013426155.1 105
Serratia marcescens SM39 1334564 GCA_000828775.1 101
Verrucomicrobia bacterium HZ-65 2026799 GCA_002310495.1 282
Verrucomicrobia bacterium IMCC26134 1637999 GCA_000972765.1 181
Xanthomonas citri subsp. citri A306 1308541 GCA_000816885.1 171
Xanthomonas citri subsp. citri Aw12879 1137651 GCA_000349225.1 170

3 The Binary CAZyme/non-CAZyme classification

The assignment of CAZy family annotations to proteins by a CAZyme classifier, identifies the protein as a CAZyme. If no CAZy family annotations are assigned to a protein by a CAZyme classifier, the protein is identified as a non-CAZyme. This section of notebook evaluates the performance of the CAZyme classifiers dbCAN (which incorporates HMMER, Hotpep and DIAMOND), CUPP and eCAMI for this binary CAZyme/non-CAZyme classification.

3.1 Summary statistics

For each classifier, for each test set the specificity, sensitivity (recall), precision, F1-score and accuracy were calculated. The mean of each statistical parameter was calculated for each classifier across all tests, to represent the overall performance of each CAZyme classifier. These results are presented in table 3.1. The performances of the classifiers for each statistical parameters are discussed in separate sections below.

Table 3.1: Overall performance of CAZyme classifiers differentiation between CAZymes and non-CAZymes
Prediction Tool Mean Specificity Specificity Standard Deviation Mean Recall Recall Standard Deviation Mean Precision Precision Standard Deviation Mean F1-score F1-score Standard Deviation Mean Accuracy Accuracy Standard Deviation
dbCAN 0.9820 0.0471 0.8979 0.1210 0.9824 0.0436 0.9323 0.0915 0.9399 0.0654
dbCAN-HMMER 0.9836 0.0448 0.8777 0.0849 0.9833 0.0433 0.9245 0.0673 0.9306 0.0504
dbCAN-Hotpep 0.9766 0.0497 0.8174 0.1312 0.9752 0.0481 0.8823 0.0919 0.8970 0.0679
dbCAN-DIAMOND 0.9833 0.0389 0.8857 0.1576 0.9829 0.0396 0.9215 0.1246 0.9345 0.0816
CUPP 0.9820 0.0479 0.8541 0.0806 0.9821 0.0438 0.9108 0.0531 0.9181 0.0447
eCAMI 0.9766 0.0489 0.8580 0.1346 0.9766 0.0455 0.9062 0.0910 0.9173 0.0680

3.1.1 Specificity

Specificity is the proportion of known negatives (in this case known non-CAZymes) which are correctly classified as negatives/non-CAZymes. Figure 3.1 is a graphical representation of the results calculated in table 3.1.

One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying a box plot of interquartile ranges to represent the distribution of specificities across all test sets.

Figure 3.1: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying a box plot of interquartile ranges to represent the distribution of specificities across all test sets.

All tools showed a low probability of misclassifying non-CAZymes as CAZymes, inferring CAZyme predictions by these tools showed be treated as confident predictions. The weakest tools in this catagory were the k-mer methods Hotpep and eCAMI. The third k-mer method, CUPP, showed a similar performance to dbCAN.

3.1.2 Sensitivity

Sensitivity (recall) is the proportion of known CAZymes that are correctly identified as CAZymes. Figure 3.2 shows the sensitivity for each test set for each classifier.

One-dimensional scatter plot of sensitivity (recall) of CAZyme and non-CAZyme predictions per test set, overlaying a box plot of interquartile ranges to represent the distribution of sensitivities across all test sets.

Figure 3.2: One-dimensional scatter plot of sensitivity (recall) of CAZyme and non-CAZyme predictions per test set, overlaying a box plot of interquartile ranges to represent the distribution of sensitivities across all test sets.

DIAMOND and dbCAN demonstrated the stongest performances with the highest mean sensitivities, and highest quartile values. Hotpep showed the weakest performance with the lowest mean and greatest interquartile range, showing poor consistency in performance.

The mean sensitivity across all test sets for eCAMI (0.8580 to 4 d.p.) was greater than that of CUPP (0.8541 to 4 d.p.). However, the standard deviation for eCAMI was greater than CUPP, and so was the interquartile range. Therefore, eCAMI potentially has a higher probability of correctly identifying a known CAZyme as a CAZyme, but the range in performance is less consistent than CUPP.

The sensitivities of all the classifiers infers that it is unlikely they will identify the complete CAZome from a candidate species, although the classifiers will identify the majority of CAZymes within the CAZyome. dbCAN and DIAMOND will identify at least 90% of CAZymes within the CAZome, eCAMI and Hotpep will tend to identify 80-90% of a species CAZome.

3.1.3 Precision

Precision is the proportion of positive predictions by the classifiers that are correct. In this case, precision represents the fraction of CAZyme predictions by the classifiers that are correct, specifically the proportion of predicted CAZymes that are known CAZymes. Figure 3.3 depicts the precision of classifer for each test set 3.1.

One-dimensional scatter plot of precision scores of CAZyme and non-CAZyme predictions per test set, overlaying a box plot of interquartile ranges to represent the distribution of precisions across all test sets.

Figure 3.3: One-dimensional scatter plot of precision scores of CAZyme and non-CAZyme predictions per test set, overlaying a box plot of interquartile ranges to represent the distribution of precisions across all test sets.

All tools demonstrated that a vast majoirty of CAZyme (positive) predictions are correct, and the tools generate few false positives. This infers high confidence can be assigned to CAZyme (positive) predictions generated by the CAZyme classifiers, however, taking into consideration, recall, the classifiers will not identify all CAZymes within a CAZome.

Again, all tools demonstrated a similar strength in performance, except the the k-mer based methods Hotpep and eCAMI.Based upon the standard deviation of prediction scores acros all test sets, Hotpep and eCAMI are highly likely to generate a larger proportion of false positives than all other CAZyme classifiers evaluated, with approximately 3-5% of CAZyme predictions being false positives from Hotpep and eCAMI.

3.1.4 F1-score

The F1-score is a harmonic (or weighted) average of recall and precision and provides an idea of the overall performance of the tool, 0 being the lowest and 1 being the best performance. Figure 3.4 shows the F1-score from each test set, for each classifier.

One-dimensional scatter plot of the F1-score of CAZyme and non-CAZyme predictions per test set, overlaying boxplot of interquartile ranges to represent the distribution of F1-scores across all test sets.

Figure 3.4: One-dimensional scatter plot of the F1-score of CAZyme and non-CAZyme predictions per test set, overlaying boxplot of interquartile ranges to represent the distribution of F1-scores across all test sets.

dbCAN and DIAMONd had the highest quartile values but HMMER produced a higher mean F1-score and smaller interquarile range than DIAMOND. Therefore, dbCAN, HMMER and DIAMONd demonstrated the strongest performances.

Hotpep showed the weakest performance with the lowest mean F1-score and greatest interquartile range, inferring poor performance consistency.

CUPP demonstrated a stronger performance than eCAMI with a higher mean F1-score, smaller standard deviation and interquartile ranging inferring in general CUPP will produce a higher F1-score and has a more consistent performance than eCAMI.

3.1.5 Accuracy

Accuarcy (calculated using (TP + TN) / (TP + TN + FP + FN) ) provides an idea of the overall performance of the classifiers as a measure of the degree to which their CAZyme/non-CAZyme predictions conforms to the correct result. Figure 3.5 is a plot of respective data from table 3.1.

One-dimensional scatter plot of accuracies of CAZyme and non-CAZyme predictions per test set, overlaying a boxplot of interquartile ranges to represent the distribution of accuracies across all test sets.

Figure 3.5: One-dimensional scatter plot of accuracies of CAZyme and non-CAZyme predictions per test set, overlaying a boxplot of interquartile ranges to represent the distribution of accuracies across all test sets.

Similar to the F1-scores, dbCAN and DIAMOND showed the best performance. Arguably, Hotpep demonstrated the worst performance although it was similar to that of the other k-mer methods, CUPP and eCAMI. This infers that alone, the k-mer methods are not as efficient at differentiating between CAZymes and non-CAZymes as methods that rely on a more global sequence similarity, such as HMMER and DIAMOND.

3.2 Expected Range of Accruacy

The statistics evaluated above provide an idea of the general performance of the tools, but they do not provide an idea of the expect range of performance. Specifically, the data does not provide a clear image of the best and worse performance a user can expect when using these tools.

To compare the expected typical range in accuracies for each classifier, 6 test sets (identified by the source genomic assemblies) were selected at random. The CAZyme/non-CAZyme predictions for each classifier, for each test set, were bootstrap resampled 100 times each, and for each bootstrap sample the accuracy calculated. The accuracies of the bootstrap samples for each classifier were plotted on stacked histograms, shown in figure 3.6.

Stacked histograms of bootstrap sample accuracies of CAZyme classifiers' differentiation between CAZymes and non-CAZymes. 6 test sets (identified by their source genomic assembly) were selected at random. The CAZyme/non-CAZyme predictions for each classifier, for each test set, were bootstrap resampled 100 times. The accuracy of each of the 600 bootstrap samples per test set were plotted as a stacked histogram.

Figure 3.6: Stacked histograms of bootstrap sample accuracies of CAZyme classifiers’ differentiation between CAZymes and non-CAZymes. 6 test sets (identified by their source genomic assembly) were selected at random. The CAZyme/non-CAZyme predictions for each classifier, for each test set, were bootstrap resampled 100 times. The accuracy of each of the 600 bootstrap samples per test set were plotted as a stacked histogram.

3.3 Investigation of Non-CAZymes classified as CAZymes (False positives)

Few of the known non-CAZymes were classified as CAZymes by the CAZyme classifiers. The cause of the non-CAZymes being classified as CAZymes may be because:
- of a very high sequence similarity between the non-CAZyme and known CAZymes - CAZy incorrectly classifying the non-CAZyme as a CAZyme - Exclusion of a protein from CAZy is not a strong enough method to definiatively define a protein as a non-CAZyme

The latter two points maybe true if all 6 classifiers classify the non-CAZyme as a CAZyme.

To explore the first point the Blast Score ratios of all false positive CAZyme predictions were plotted on a boxplot.

BLAST Score ratio of all CAZy classified non-CAZymes falsely classified as CAZymes by at least 5 of the CAZyme prediction tools dbCAN, HMMER, Hotpep, DIAMOND, CUPP and eCAMI

(#fig:fp.cor)BLAST Score ratio of all CAZy classified non-CAZymes falsely classified as CAZymes by at least 5 of the CAZyme prediction tools dbCAN, HMMER, Hotpep, DIAMOND, CUPP and eCAMI

Figure @ref(fig:fp.cor) demonstrates that there is no correlation between the probability of protein not being classified in CAZy and being clasified by the CAZyme prediction tools as a CAZyme.

This leave the latter two reasons for cause of false positive classification of proteins not included in CAZy being classified by the CAZyme prediction tools as CAZymes: - CAZy incorrectly classifying the non-CAZyme as a CAZyme - Exclusion of a protein from CAZy is not a strong enough method to definitively define a protein as a non-CAZyme

These two points both allude to the concept that CAZy maybe the most comprehensive CAZyme database, but it is not comprehensive. This is a very likely possibility owing to out sequencing capcity far exceeding our capcity to accuractly annotate protein functions, therefore, it is very likely there are CAZymes that have not yet been analysed by CAZy. Consequently, exclusion from CAZy should maybe not be interrpreted as definitive identification of a non-CAZyme annotation.

…add in table to potential of ‘non-CAZymes’ being CAZymes… … … …

3.4 Conclusions of binary CAZyme and non-CAZyme classifications

4 CAZy Class Prediction

The CAZyme prediction tools predict the CAZy family annotations of CAZymes. CAZy families are catalogued into one of size CAZy classes (definitions taken from www.cazy.org):

  • Glycoside Hydrolases (GHs) : hydrolysis and/or rearrangement of glycosidic bonds (see CAZypedia definition)
  • GlycosylTransferases (GTs) : formation of glycosidic bonds (see definition)
  • Polysaccharide Lyases (PLs) : non-hydrolytic cleavage of glycosidic bonds
  • Carbohydrate Esterases (CEs) : hydrolysis of carbohydrate esters
  • Auxiliary Activities (AAs) : redox enzymes that act in conjunction with CAZymes.

It may be that a prediction tool is unable to accurately to predict the specific CAZy family for a protein but can accurately predict the correct CAZy class. This section of the notebook evaluates the performance of each of the prediction tools to predict the correct CAZy class, irrespective if the child CAZy family prediction is correct. No previous evaluations of the CAZyme prediction tools have evaluated the performance the tools at the level of CAZy class prediction.

4.1 Multilabel CAZy Class Prediction Performance

A single CAZyme can be included in multiple CAZy classes leading to the multilabel classification of CAZymes. To address this and evaluate the multilabel classification of CAZy classes the Rand Index (RI) and Adjusted Rand Index (ARI) were calculated.

The RI is the measure of agreemment across all potential classifications of a protein. The RI ranges from 0 (no correct annotations) to 1 (all annotations correct) 4.1. The ARI is the RI adjusted for chance, where 0 is the equivalent to assigning the CAZy class annotations randomly, -1 where the annotations are systematically handed out incorrectly and 1 where the annotations are all correct 4.2.

Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 4.1: Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 4.2: Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

4.2 CAZy class Prediction Performance

Evaluate the performance of each prediction for each CAZy class, independent of the performance of the other CAZy classes. Excluded true negative non-CAZyme predictions.

figure 4.3

Proportional area plot shaded to represent the distribution of Fbeta-score for each CAZy class for each test set parsed by CAZyme prediction tools.

Figure 4.3: Proportional area plot shaded to represent the distribution of Fbeta-score for each CAZy class for each test set parsed by CAZyme prediction tools.

4.3 Prediction of Each CAZy Class

4.3.1 Glycoside Hydrolases Class Prediction Performance

4.3.2 GlycosylTransferases Class Prediction Performance

4.3.3 Polysaccharide Lyases Class Prediction Performance

4.3.4 Carbohydrate Esterases Class Prediction Performance

4.3.5 Auxiliary Activities Class Prediction Performance

4.3.6 Carbohydrate-Binding Modules Class Prediction Performance

5 CAZy Family Prediction

5.1 Multilabel CAZy Family Prediction Performance

A single CAZyme can be included in multiple CAZy families, from multiple different CAZy classes, resulting in the multilabel classification of CAZymes. To address this and evaluate the multilabel classification of CAZy families the Rand Index (RI) and Adjusted Rand Index (ARI) were calculated.

The RI is the measure of agreemment across all potential classifications of a protein. The RI ranges from 0 (no correct annotations) to 1 (all annotations correct) 5.1. The ARI is the RI adjusted for chance, where 0 is the equivalent to assigning the CAZy family annotations randomly, -1 where the annotations are systematically handed out incorrectly and 1 where the annotations are all correct 5.1.

Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy families.

Figure 5.1: Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy families.

Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy families.

Figure 5.2: Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy families.

5.2 CAZy Family Prediction Performance

Evaluate the performance of each prediction for each CAZy family, independent of the performance of the other CAZy families. Excluded true negative non-CAZyme predictions.

figure 5.3

Proportional area plot shaded to represent the distribution of Fbeta-score for each CAZy family for each test set parsed by CAZyme prediction tools.

Figure 5.3: Proportional area plot shaded to represent the distribution of Fbeta-score for each CAZy family for each test set parsed by CAZyme prediction tools.

5.3 Prediction of Each CAZy Family

To evaluate the performance of each prediction tool for each CAZy family, the families were grouped by their CAZy class and the specificity for each prediction tool was plotted against the sensitivity for each CAZy family. To find families that most tools showed a poor performance (defined as a expression(paste(“F”, beta, “-score”))-score less than 0.75), heatmaps of the expression(paste(“F”, beta, “-score”)) for CAZy families for which at least 3 tools produced a expression(paste(“F”, beta, “-score”)) less than 0.75, also comparing the size of the number of CAzyme records in family in CAZy and the number of family members included across all test sets.

5.3.1 Predicting Families From Glycoside Hydrolases

Figure 5.4 shows the specificity and sensitivity of each prediction tool for each Glycoside Hydrolase family. All prediction tools showed a very strong performance for specificity, with no tool producing a specificity score less than 0.995.

dbCAN had the most families with a sensitivity greater than or equal to 0.9 (99 families), closely followed by HMMER and CUPP (with 97 families each). However, dbCAN and HMMER had more families with a sensitivity greater than or equal to 0.75 than CUPP (114, 113 and 103 families respectively). Therefore, dbCAN and HMMER showed the strongest performances for GH families.

dbCAN-Hotpep, CUPP and eCAMI showed the weakest performances, with the most families producing a sensitivity score less than 0.75. However, eCAMI and dbCAN-Hotpep had the most families with a specificity score less than 0.995, although the specificity scores from Hotpep were lower than that of eCAMI. Therefore, dbCAN-Hotpep showed the weakest performance, although overall the performances between the tools were similar. dbCAN demonstrated the strongest performance with the most families with a sensitivity greater than 0.75 and specificity of 1.

Scatter plot of specificity against sensitivity for each CAZy family within the Glycoside Hydrolase class. Hover cursor over each point to see the specific sensitivity and specificity.

Figure 5.4: Scatter plot of specificity against sensitivity for each CAZy family within the Glycoside Hydrolase class. Hover cursor over each point to see the specific sensitivity and specificity.

Figure 5.4: Scatter plot of specificity against sensitivity for each CAZy family within the Glycoside Hydrolase class. Hover cursor over each point to see the specific sensitivity and specificity.

5.3.1.1 Identify potentially difficult to classify GH families

To identify GH CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure 5.5. It was no suprise GH0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified as GHs but cannot not determine the CAZy family annotation, therefore GH0 includes members from multiple different CAZy families. Thus, GHO has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for GH0 is typcially lower than other CAZy families for all prediction tools. For families GH163-GH170, these families are not included into the models within the prediction tools (except HMMER which does included GH163), therefore, these tools cannot predict members of these families.

The remaining families, contained very small sample sizes of less than 10 proteins. Thus, the odds of producing a low expression(paste(“F”, beta, “-score”)) is signficantly increased, and the probability of producing a low expression(paste(“F”, beta, “-score”)) is much greater than producing a high expression(paste(“F”, beta, “-score”)).

Heatmap of Glycoside Hydrolases families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. 'Family population' is the number of CAZyme records in each family in CAZy, and 'Sample size' is the number of proteins from the CAZy family included across all test sets.

Figure 5.5: Heatmap of Glycoside Hydrolases families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.

5.3.2 Predicting Families From GlycosylTransferases

Figure 5.6 shows the specificity against sensitivity from each CAZyme prediction for each GT family. All tools showed an extremely strong performance for specificity, no tool produced a specificity of less than 0.9985.

The k-mer based methods, dbCAN-Hotpep, CUPP and eCAMI showed the weakest performances because they had the most families with a sensitivity of less than 0.75.

HMMER had the most families with a sensitivity equal to or greater than 0.9 (51 GT families); however, DIAMOND had the most families with a sensitivity equal to or greater than 0.75, and had the fewest families with a sensitivity less than 0.75 (58 and 11 GT families respectively). Therefore, HMMER and DIAMOND both showed the strongest performances.

dbCAN had the most families with a specificity greater than 0.99975, but the difference in specificity scores was so small that it was not possible to differentiate the performance of the predictions tool by specificity. However, DIAMOND had the most families with a sensitivity greater than 0.75, and thus showed the strongest performance for sensitivity out of all the prediction tools, for predicting GT families.

Figure 5.6: Scatter plot of specificity against sensitivity for each CAZy family within the GlycosylTransferases class. Hover cursor over each point to see the specific sensitivity and specificity.

5.3.2.1 Identify potentially difficult to classify GT families

To identify GT CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure 5.7. It was no suprise GT0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified as GTs but cannot not determine the CAZy family annotation, therefore GT0 includes members from multiple different CAZy families. Thus, GT0 has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for GT0 is typcially lower than other CAZy families for all prediction tools.

For families GT109-113, these families are not included into the models within the prediction tools, therefore, these tools cannot predict members of these families.

The remaining families (except GT29 and GT31), contained very small sample sizes of less than 10 proteins. Thus, the odds of producing a low expression(paste(“F”, beta, “-score”)) is signficantly increased, and the probability of producing a low expression(paste(“F”, beta, “-score”)) is much greater than producing a high expression(paste(“F”, beta, “-score”)).

For GT29 and GT31 had 30 and 20 family members included across all test sets respectively. These samples sizes are suitable for producing an accurate representation of the performance of the prediction tools performances for these families. A potential reason for the poor performance from most of the prediction tools for these families is that GT29 and GT31 contain a greater sequence diversity than families for which most prediction tools performed well (expression(paste(“F”, beta, “-score”)) greater than 0.75). A greater sequence diversity in the family would make accurately modeling these families and thus predicting family members more difficult.

Heatmap of GlycosylTransferases families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. 'Family population' is the number of CAZyme records in each family in CAZy, and 'Sample size' is the number of proteins from the CAZy family included across all test sets.

Figure 5.7: Heatmap of GlycosylTransferases families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.

5.3.3 Predicting Families From Polysaccharide Lyases

Figure 5.8 plots the specificity against sensitivity from each CAZyme prediction for each PL family. All prediction tools showed a very strong specificity performance, with no family scoring less than 0.9997. The differences between specificity scores were so small the performances of the prediction tools could not be differentiated by specificity.

The spread (meaning variation) in sensitivity scores for PL families was greater than that for GH and GT families. HMMER showed the strongest performance, because it had the most families with a sensitivity score of greater than 0.9, 16 PL families. Hotpep and eCAMI had the most families with a sensitivity of less than 0.75, 9 families each; however, eCAMI had fewer families score a sensitivity of greater than or equal to 0.9 than Hotpep (6 and 10 families, respectively), therefore, eCAMI showed the weakest performance.

Figure 5.8: Scatter plot of specificity against sensitivity for each CAZy family within the Polysaccharide Lyases class.

5.3.3.1 Identify potentially difficult to classify PL families

To identify GT CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure 5.9. It was no suprise PL0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified as PLs but cannot not determine the CAZy family annotation, therefore PL0 includes members from multiple different CAZy families. Thus, PL0 has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for PL0 is typcially lower than other CAZy families for all prediction tools.

CUPP and eCAMI do not include any PL families newer than PL28, therefore, they could not predict members of PL31, PL33 and PL38. dbCAN and incorporated HMMER, Hotpep and DIAMOND are based upon on more recent version of CAZy and do include PL31 and PL33 but they do not include PL38.

The remaining families, contained very small sample sizes of less than 10 proteins. Thus, the odds of producing a low expression(paste(“F”, beta, “-score”)) is significantly increased, and the probability of producing a low expression(paste(“F”, beta, “-score”)) is much greater than producing a high expression(paste(“F”, beta, “-score”)). 5 members of PL17 were included across the tests, and three prediction tools produced a expression(paste(“F”, beta, “-score”)) greater than 0.88, inferring the tools may perform well against PL17 family members but the limited sample favours producing a lower expression(paste(“F”, beta, “-score”)).

Heatmap of Polysaccharide Lyases families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. 'Family population' is the number of CAZyme records in each family in CAZy, and 'Sample size' is the number of proteins from the CAZy family included across all test sets.

Figure 5.9: Heatmap of Polysaccharide Lyases families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.

5.3.4 Predicting Families From Carbohydrate Esterases

Figure 5.10 plots the specificity against sensitivity from each CAZyme prediction for each CE family.

All prediction tools showed a very strong specificity performance, with no family scoring less than 0.998. The differences between specificity scores were so small the performances of the prediction tools could not be differentiated by specificity.

dbCAN, HMMER, and Hotpep showed the strongest performance with the most families scoring a sensitivity equal to or greater than 0.75. dbCAN and HMMER showed a slightly stronger performance than Hotpep with more families with a sensitivity equal to or greater than 0.9. Unlike previously showing one of the strongest performances, DIAMOND showed the weakest performance with the most families with a sensitivity less than 0.75. However, eCAMI had the fewest families with a sensitivity equal to or greater than 0.9.

Figure 5.10: Scatter plot of specificity against sensitivity for each CAZy family within the Carbohydrate Esterases class.

5.3.4.1 Identify potentially difficult to classify CE families

To identify CE CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure ??. It was no surprise CE0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified as CEs but cannot not determine the CAZy family annotation, therefore CE0 includes members from multiple different CAZy families. Thus, CE0 has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for CE0 is typically lower than other CAZy families for all prediction tools.

None of the prediction tools include the CAZy family CE18, therefore, known of the prediction tools could predict any of the proteins belong to CE18, resulting in poor performances.

CE16 is included in all the prediction tools but only three family members were included across all test sets. A sample size this small significantly increases the probability of producing a low expression(paste(“F”, beta, “-score”)). Therefore, the prediction tools are unlikely to truly perform poorly for CE16, the sample size was more influential in producing a low expression(paste(“F”, beta, “-score”)).

Heatmap of Carbohydrate Esterase families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. 'Family population' is the number of CAZyme records in each family in CAZy, and 'Sample size' is the number of proteins from the CAZy family included across all test sets.

Figure 5.11: Heatmap of Carbohydrate Esterase families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.

5.3.5 Predicting Families From Auxiliary Activities

Figure 5.12 plots the specificity against sensitivity from each CAZyme prediction for each CE family.

All prediction tools showed a very strong specificity performance, with no family scoring less than 0.997. The differences between specificity scores were so small the performances of the prediction tools could not be differentiated by specificity.

HMMER had the most families with a sensitivity score equal to or greater than 0.9, and thus showed the strongest performance. DIAMOND, CUPP and eCAMI showed the weakest performance with the most families scoring a sensitivity less than 0.75.

CUPP and dbCAN demonstrated similarly strong performances, with both tools with 8 AA families with a sensitivity equal to or greater than 0.9. However, dbCAN had more families with a sensitivity equal to or greater than 0.75 than CUPP, therefore, dbCAN showed a slightly stronger performance than CUPP.

Figure 5.12: Scatter plot of specificity against sensitivity for each CAZy family within the Auxiliary Activities class.

5.3.5.1 Identify potentially difficult to classify AA families

To identify AA CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure 5.13. It was no suprise AA0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified as AAs but cannot not determine the CAZy family annotation, therefore AA0 includes members from multiple different CAZy families. Thus, AA0 has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for AA0 is typcially lower than other CAZy families for all prediction tools.

The remaining families, contained very small sample sizes of less than 10 proteins. Thus, the odds of producing a low expression(paste(“F”, beta, “-score”)) is significantly increased, and the probability of producing a low expression(paste(“F”, beta, “-score”)) is much greater than producing a high expression(paste(“F”, beta, “-score”)).

Heatmap of Auciliary Activities families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. 'Family population' is the number of CAZyme records in each family in CAZy, and 'Sample size' is the number of proteins from the CAZy family included across all test sets.

Figure 5.13: Heatmap of Auciliary Activities families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.

5.3.6 Predicting Families From Carbohydrate-Binding Modules

Figure 5.14 plots the specificity against sensitivity from each CAZyme prediction for each CE family.

CUPP predicted not members of any CBM families. CUPP was invoked three times, and the output files searched for predictions of CBM families but none were found. Therefore, CUPP showed the weakest performance of CBM families.

dbCAN and DIAMOND had the most families with a sensitivity greater than or equal to 0.9 (35 families each), and had similar numbers of families with a sensitivity greater than 0.75 (44 and 43 families respectively). Therefore, both dbCAN and DIAMOND showed the strongest performance.

Hotpep showed a slightly stronger perfromance than eCAMI with more families with a sensitivity greater than or equal to 0.9 and 0.75.

All prediction tools (except CUPP) showed a very strong specificity performance, with no family scoring less than 0.98. The differences between specificity scores were so small the performances of the prediction tools could not be differentiated by specificity.

Scatter plot of specificity against sensitivity for each CAZy family within the Carbohydrate-Binding Modules class.

Figure 5.14: Scatter plot of specificity against sensitivity for each CAZy family within the Carbohydrate-Binding Modules class.

Figure 5.14: Scatter plot of specificity against sensitivity for each CAZy family within the Carbohydrate-Binding Modules class.

5.3.6.1 Identify potentially difficult to classify CBM families

To identify CBM CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure 5.15. It was no surprise CBM0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified asCBM but cannot not determine the CAZy family annotation, therefore AA0 includes members from multiple different CAZy families. Thus, CBM0 has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for CBM0 is typically lower than other CAZy families for all prediction tools.

The remaining families, contained very small sample sizes of less than 10 proteins. Thus, the odds of producing a low expression(paste(“F”, beta, “-score”)) is significantly increased, and the probability of producing a low expression(paste(“F”, beta, “-score”)) is much greater than producing a high expression(paste(“F”, beta, “-score”)).

Heatmap of Carbohydrate Binding Module families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. 'Family population' is the number of CAZyme records in each family in CAZy, and 'Sample size' is the number of proteins from the CAZy family included across all test sets.

Figure 5.15: Heatmap of Carbohydrate Binding Module families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.

6 Final conclusions